-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[trainer, worker] feat: more flexible and easy-to-use reward model #3679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does sglang |
Yes, I checked the related issues and found that the same phenomenon was mentioned in sgl-project/sglang#6367 (comment). In short, since RL normally uploads a new set of parameters, sglang simply discards the old ones to speed up. I also looked into the recent PR sgl-project/sglang#10873, which seems to add support for reusing the original weights by keeping a stored copy. |
|
The PR seems to be included in sglang 0.5.3. |
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
The current reward model implementation faces the following challenges:
What this PR does
To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes:
RewardModelManagerandRewardManagerWorker, with some runnable scripts inrecipe/fapo.RewardModelManagerfirst launches multiple reward servers and then adopts a router-based approach to manage these servers (using SGLang Router), distributing requests to reward servers.RewardManagerWorkerretrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following:This implementation provides a
reward_modelinterface in thecompute_scoremethod, maximizing flexibility and convenience for algorithmic design.Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously.
Integration with AgentLoop
This PR introduces asynchronous reward computation for individual samples (
async def run_single(self, data: DataProto) -> dict) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency.Moreover, this implementation can be integrated with
agentloopfor improved efficiency (has been implemented):In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout.
With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions.
Runable Scripts
A runnable example is provided in
recipe/fapo/. The newly introduced parameters for this implementation are placed infapo/configand will be integrated into the main codebase upon completion of the refactoring.